Electrical power plants need a fault detection system in minimum possible time for both equipment protection and remaining stable. For this, a power system is designed using MATLAB by robotics engineers, where the circuit is simulated under normal and faulty conditions, results of which has been recorded to a dataset. The dataset can be found at: https://www.kaggle.com/esathyaprakash/electrical-fault-detection-and-classification?select=detect_dataset.csv. [1]
The following table shows the simulation instances where voltage and current values for sources A, B, and C are measured as Va, Vb, Vc, Ia, Ib, and Ic respectively. And the output field indicates whether the system is faulty. As can be inferred, this is a binary classification, ie, the system is either faulty (indicated by output = 1) or not (indicated by output = 0).
head(data)
## Output..S. Ia Ib Ic Va Vb Vc
## 1 0 -170.47220 9.2196135 161.25258 0.0544900 -0.6599209 0.6054309
## 2 0 -122.23575 6.1686674 116.06709 0.1020000 -0.6286115 0.5262016
## 3 0 -90.16147 3.8136322 86.34784 0.1410255 -0.6052769 0.4642513
## 4 0 -79.90492 2.3988035 77.50611 0.1562725 -0.6022353 0.4459629
## 5 0 -63.88525 0.5906674 63.29459 0.1804515 -0.5915014 0.4110499
## 6 0 -55.95468 -1.0018817 56.95656 0.1934141 -0.5906954 0.3972813
faulty <- subset(data, data$Output..S. == 1)
no_fault <- subset(data, data$Output..S. == 0)
cat("The number of instances with no fault: ", length(no_fault$Output..S.))
## The number of instances with no fault: 6505
cat("The number of instances with fault: ", length(faulty$Output..S.))
## The number of instances with fault: 5496
The following plots show the effect of change in inputs (“Ia”, “Ib”, “Ic”, “Va”, “Vb”, “Vc”) on the output (faultiness).
For faultiness, only by observing the voltage values, it is hard to say something, as all cases with non-faulty outputs have a wider range of voltages, containing that of the faulty ones. Yet, the measured voltage values falling to the narrow range of input with faulty outcomes, although it overlaps with the one with the non-faulty output, can be considered as the “risky” ones. The range of current values with non-faulty outputs is far more narrow than that of the faulty ones. That is, the current values falling to these areas, not covered by current values with non-faulty outputs, are likely to be faulty.
The following code snippet is to produce the principal components in R.
my_pc <- prcomp(data[,2:7], scale = TRUE)
print(my_pc)
## Standard deviations (1, .., p=6):
## [1] 1.308099e+00 1.215088e+00 1.193685e+00 1.098293e+00 4.258003e-01
## [6] 6.153101e-06
##
## Rotation (n x k) = (6 x 6):
## PC1 PC2 PC3 PC4 PC5 PC6
## Ia -0.3214696 0.05631243 0.19291470 -0.7715964 -0.510792617 5.430812e-08
## Ib 0.4815361 0.09135445 0.51824293 0.3409713 -0.612323584 2.485886e-08
## Ic -0.3709766 -0.17574352 -0.57681925 0.3881147 -0.590031399 2.355630e-08
## Va -0.1526212 0.79767914 -0.07124924 0.1056383 -0.002491956 -5.693616e-01
## Vb 0.5660423 -0.30660070 -0.38038483 -0.2956138 -0.087155015 -5.881462e-01
## Vc -0.4283216 -0.47676147 0.46012896 0.1979837 0.091714084 -5.743791e-01
summary(my_pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.3081 1.2151 1.1937 1.0983 0.42580 6.153e-06
## Proportion of Variance 0.2852 0.2461 0.2375 0.2010 0.03022 0.000e+00
## Cumulative Proportion 0.2852 0.5313 0.7687 0.9698 1.00000 1.000e+00
One can yield from the previous table, showing the weights of the real features to the linear combination to produce the principal components, that most of the weights come from the voltage values, ie, Vb, Va, and Va for PC1, PC2 and PC3 respectively. This seems convenient with the comment made in the conclusion in the first step, ie, the voltage values measured correspond to a greater variance.
The following table and graph show the “importance” of the components, i.e., how informative they are to represent the majority of the instances.
summary(my_pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.3081 1.2151 1.1937 1.0983 0.42580 6.153e-06
## Proportion of Variance 0.2852 0.2461 0.2375 0.2010 0.03022 0.000e+00
## Cumulative Proportion 0.2852 0.5313 0.7687 0.9698 1.00000 1.000e+00
screeplot(my_pc)
As can be seen from the preceding graph, the first four components are the most “important” ones, as they carry much of the proportion of variance. And hence, it seems, the first two components do not suffice to represent the whole data. That’s why, I do not expect to see distinct clusters when visualized in 2d. In fact, the first four components seem to carry the most of the information; yet, for visualization, I will only be able to try with 2D and 3D.
The following figure is the distribution of data points based on the first two principal components.
ggplot2::autoplot(my_pc, data = data, colour = 'Output..S.')
Although I was not expecting to see distinct clusters in 2D, ie, by using only the first two principal components, it seems they do form clusters. One can observe that the non-faulty ones (represented by dots plotted in darker blue) and the rest seem to form clusters in 3d although the graph is in 2d.
The graph seems that it will become more meaningful if plotted in 3d. The interactive 3d plot can be seen below.
PC <- (my_pc$x)
PC1 <- c(PC[,1])
PC2 <- c(PC[,2])
PC3 <- c(PC[,3])
library(rgl)
cols = factor(data$Output..S., labels = c("red", "black"))
plot3d(x=PC1, y=PC2, z=PC3, col=cols)
legend3d("topright", legend = paste('Type: ', c('Non-Faulty', 'Faulty')), pch = 16, col=c("red", "black"), cex=1, inset=c(0.02))
rglwidget()
As can be seen from the plots (both in 2d and 3d), the non-faulty ones (red dots) form a cluster much more distinctly. This is also convenient with the previous discussion, so the evidence to support it strengthens.
To apply MDS, we first need a distance/proximity/(dis)similarity matrix, computed as follows:
(Due to memory constraints, I had to sample the dataset by a ratio of 1:10).
data_sampled <- data[sample(1:nrow(data), 1000), ]
dist_n <- dist(data_sampled, method = 'euclidean')
Based on the previous analysis, I plotted the rest of the graphs in 3D for meaningful visualization purposes.
MDS transforms the dissimilarity matrix to disparities to maximize the Kruskal’s stress function, ie, to optimally scale the dissimilarity values. There are multiple transformation functions provided in the smacof package. [2] Here, I applied three of them; namely, Ratio MDS, Interval MDS, and Monotone spline MDS, all of which provide a different levels of flexibility for transformation of dissimilarity to scaled distance.
cols = factor(data_sampled$Output..S., labels = c("red", "black"))
mds3 = mds(delta = dist_n, ndim = 3, type = "ratio")
library(rgl)
plot3d(mds3$conf, col = cols)
legend3d("topright", legend = paste('Type: ', c('Non-Faulty', 'Faulty')), pch = 16, col=c("red", "black"), cex=1, inset=c(0.02))
rglwidget()
mds3 = mds(delta = dist_n, ndim = 3, type = "interval")
library(rgl)
plot3d(mds3$conf, col = cols)
legend3d("topright", legend = paste('Type: ', c('Non-Faulty', 'Faulty')), pch = 16, col=c("red", "black"), cex=1, inset=c(0.02))
rglwidget()
mds3 = mds(delta = dist_n, ndim = 3, type = "mspline")
library(rgl)
plot3d(mds3$conf, col = cols)
legend3d("topright", legend = paste('Type: ', c('Non-Faulty', 'Faulty')), pch = 16, col=c("red", "black"), cex=1, inset=c(0.02))
rglwidget()
Although sampled, ie, not all data is included, the data seem to form clusters well. Furthermore, the plotted graphs seem all convenient within themselves, and the previous PCA results.
All MDS methods results in more or less the same; that’s why, it strengthens the conclusions arrived at previous steps, ie, the measurements for non-faulty systems form a more strict cluster.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
data_sampled <- data[sample(1:nrow(data), 1000), ]
true_labels <- factor(data_sampled$Output..S.)
data_normalized <- normalize(data_sampled[,2:ncol(data_sampled)])
set.seed(123) # To get the same random sample
dat.d <- sample(1:nrow(data_normalized),size=nrow(data_normalized)*0.7,replace = FALSE) #random selection of 70% data.
train.data <- data_sampled[dat.d,] # 70% training data
test.data <- data_sampled[-dat.d,] # remaining 30% test data
#Now creating separate dataframe for 'Creditability' feature which is our target.
train.labels <- data_sampled[dat.d,1]
test.labels <- data_sampled[-dat.d,1]
library(class)
## Warning: package 'class' was built under R version 4.1.2
# find the optimal k
i=1 # declaration to initiate for loop
k.optm=1 # declaration to initiate for loop
for (i in 1:100){
knn.mod <- knn(train=train.data, test=test.data, cl=train.labels, k=i)
k.optm[i] <- 100 * sum(test.labels == knn.mod)/NROW(test.labels)
k=i
}
plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")
Although k=1 gives the best performance, choosing so small k is known to lead to overfitting. That’s why k=10 is chosen.
data_sampled <- data[sample(1:nrow(data), 100), ]
dist <- dist(data_sampled, method = 'euclidean')
dom <- factor(data_sampled$Output..S.)
iso <- isomap(dist, ndim=3, k=10)
rgl.isomap(iso, col = as.numeric(dom))
## Warning in ordirgl(x, ...): set isometric aspect ratio, previously was 0.855,
## 0.865, 1.837
There are mainly two methods for hierarchical clustering, namely, top-down and bottom-up approaches. For the first one, we form a huge cluster to contain all data initially and then divide into clusters; whereas, for the second one, we have clusters for each data point and then merge them to bigger clusters.
There are different methods to define the inter-cluster and intra-cluster distances. The ones listed below are the ones used:
Max Link Method: We choose the maximum distance between all pairs of data points from each cluster.
Min Link Method: We choose the minimum distance between all pairs of data points from each cluster.
Mean Method: Distance is taken to be the average of all pairs of distances between clusters.
Ward’s Minimum Variance Method: Minimizes the total within-cluster-distance (intra-cluster distance)
The dataset is subsampled so that the tree to show the hierarchical clustering and the true labels are visible, and to keep the run time feasible. Below one can see the true labels and the clusters into which they are assigned, so they have an understanding of the clustering performance.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
data_sampled <- data[sample(1:nrow(data), 100), ]
data_normalized <- normalize(data_sampled)
dist_mat_n <- dist(data_normalized, method = 'euclidean')
hc_max <- hclust(dist_mat_n, method = 'complete')
hc_max$labels <- data_sampled$Output..S.
plot(hc_max, cex = 0.6)
rect.hclust(hc_max, k = 2, border = 2:5)
hc_min <- hclust(dist_mat_n, method = 'single')
hc_min$labels <- data_sampled$Output..S.
plot(hc_min, cex = 0.6)
rect.hclust(hc_min, k = 2, border = 2:5)
hclust_avg <- hclust(dist_mat_n, method = 'average')
hclust_avg$labels <- data_sampled$Output..S.
plot(hclust_avg, cex = 0.6)
rect.hclust(hclust_avg, k = 2, border = 2:5)
hc_ward <- hclust(dist_mat_n, method = 'ward')
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
hc_ward$labels <- data_sampled$Output..S.
plot(hc_ward, cex = 0.6)
rect.hclust(hc_ward, k = 2, border = 2:5)
library(cluster)
dv <- diana(data_sampled)
dv$order.lab <- data_sampled$Output..S.
pltree(dv, cex = 0.6, hang = -1, main = "Dendrogram of diana")
rect.hclust(dv, k = 2, border = 2:10)
In the above graphs, it seems ward link method and the max link method outperforms the others. That is, the others can be said to failed to cluster distinctively or be more susceptible to outliers, for instance, there is a cluster containing only one instance in the min link method. That’s why, I will move on with the Ward-link and the max-link methods.
And, as I already knew that is a binary classification problem, I choose k=2, ie, the second level from the root, for where to cut the tree from.
For inclusivity of the whole dataset, I subsampled it several times and run the algorithm on these instances.
The following code snippet is to visualize the clusters formed using the max link method:
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
hc_max <- hclust(dist_mat_n, method = 'complete')
hc_max$labels <- data_sampled$Output..S.
clust <- cutree(hc_max, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
set.seed(123)
data_sampled <- data[sample(1:nrow(data), 100), ]
data_normalized <- normalize(data_sampled)
dist_mat_n <- dist(data_normalized, method = 'euclidean')
library(factoextra)
hc_max <- hclust(dist_mat_n, method = 'complete')
hc_max$labels <- data_sampled$Output..S.
clust <- cutree(hc_max, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
set.seed(123)
data_sampled <- data[sample(1:nrow(data), 100), ]
data_normalized <- normalize(data_sampled)
dist_mat_n <- dist(data_normalized, method = 'euclidean')
library(factoextra)
hc_max <- hclust(dist_mat_n, method = 'complete')
hc_max$labels <- data_sampled$Output..S.
clust <- cutree(hc_max, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
# set.seed(123)
data_sampled <- data[sample(1:nrow(data), 100), ]
data_normalized <- normalize(data_sampled)
dist_mat_n <- dist(data_normalized, method = 'euclidean')
library(factoextra)
hc_max <- hclust(dist_mat_n, method = 'complete')
hc_max$labels <- data_sampled$Output..S.
clust <- cutree(hc_max, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
The following code snippet is to visualize the clusters formed using the ward link method:
hc_ward <- hclust(dist_mat_n, method = 'ward')
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
hc_ward$labels <- data_sampled$Output..S.
clust <- cutree(hc_ward, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
set.seed(123)
data_sampled <- data[sample(1:nrow(data), 100), ]
data_normalized <- normalize(data_sampled)
dist_mat_n <- dist(data_normalized, method = 'euclidean')
hc_ward <- hclust(dist_mat_n, method = 'ward')
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
hc_ward$labels <- data_sampled$Output..S.
clust <- cutree(hc_ward, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
set.seed(123)
data_sampled <- data[sample(1:nrow(data), 100), ]
data_normalized <- normalize(data_sampled)
dist_mat_n <- dist(data_normalized, method = 'euclidean')
hc_ward <- hclust(dist_mat_n, method = 'ward')
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
hc_ward$labels <- data_sampled$Output..S.
clust <- cutree(hc_ward, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
set.seed(123)
data_sampled <- data[sample(1:nrow(data), 100), ]
data_normalized <- normalize(data_sampled)
dist_mat_n <- dist(data_normalized, method = 'euclidean')
hc_ward <- hclust(dist_mat_n, method = 'ward')
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
hc_ward$labels <- data_sampled$Output..S.
clust <- cutree(hc_ward, k = 2)
fviz_cluster(list(data = data_sampled, cluster = clust))
Although, there are overlaps with the clusters formed with Ward’s link hierarchical clustering algorithm, on average it seems to outperform the ones obtained with max link. Furthermore, the hierarchical clustering algorithms seem to form clusters consistent with the ones obtained with the PCA and MDS methods in that we can observe 2 clusters, although not fully-distinct, are formed.
As this is a binary classification problem, there are supposed to be 2 clusters; hence, k is initially chosen to be 2. Moreover, this algorithm starts with random initial clusters, and although it is guaranteed that it will converge, what it converges to might be the local minimum, not the global. That’s why, multiple times the same algorithm is run with different initial point configurations, and at the end of each run its performance is evaluated. And, finally, the initial configuration producing the best performance, ie, minimizing of the objective function, is chosen. The nstart parameter given as argument to the k-means function is that number of iterations tried to find the best initial configuration.
Moreover, subsampling has been applied here, as well, for the same reasons stated above.
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 100), ]
k2 <- kmeans(data_sampled1, centers = 2, nstart = 50)
fviz_cluster(k2, data = data_sampled1)
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 100), ]
k2 <- kmeans(data_sampled1, centers = 2, nstart = 50)
fviz_cluster(k2, data = data_sampled1)
Normally k is a hyper-parameter that needs to be provided, and when it is unknown, it is something tricky to find. For this, one of the common methods to find the correct k value is the elbow method in which we choose the k where the elbow occurs on the graph. The elbow choice is somewhat ambiguous, as at each k, the graph decreases. That is, as the number of k grows larger, the total loss has to decrease, however, in the extreme case, we would form clusters for each element, which is obviously not meaningful. That’s why, we try to find the k value where the elbow is visible in the graph, ie, the loss has decreased drastically. The following graph shows the loss vs number of clusters:
I used multiple subsampling to obtain the graphs to show the optimal number of clusters:
set.seed(123)
fviz_nbclust(data_sampled1, kmeans, method = "wss")
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 100), ]
fviz_nbclust(data_sampled1, kmeans, method = "wss")
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 100), ]
fviz_nbclust(data_sampled1, kmeans, method = "wss")
After subsampled multiple times, all of the graphs obtained above seem consistent.
The elbow choice seem to be meaningful on the 6th cluster. And, k = 6 might actually be a meaningful choice, as this classification further goes down from a binary classification, ie, determining whether the system is faulty, to a multi-class classification, ie, determining where the fault is. In fact, I explained these two problems defined on this dataset in the second assignment, preliminary analysis. However, I moved on with the binary classification problem. The following excerpt, taken from my second assignment, defines the problem for multi-class classification problem:
"After detecting there is a fault, the main problem is to identify where it is. There are 3 phases (A, B, and C), and the Ground (Gnd) in the system whose intervals are where the problem might occur. Hence there 6 different possibilities (there are 3 intervals) regarding where the error is. More explicitly:
[G C B A]
[0 0 0 0] - No Fault
[1 0 0 1] - LG fault (Between Phase A and Gnd)
[0 0 1 1] - LL fault (Between Phase A and Phase B)
[1 0 1 1] - LLG Fault (Between Phases A,B and ground)
[0 1 1 1] - LLL Fault(Between all three phases)
[1 1 1 1] - LLLG fault( Three phase symmetrical fault)"
As can be seen there are six possible cases, and hence, this is a 6-class classification problem.
k = 3 seems to be a candidate for an elbow, so it is my next choice for k.
k2 <- kmeans(data_sampled, centers = 3, nstart = 50)
fviz_cluster(k2, data = data_sampled)
As explained above, one of the problems defined for this dataset is a 6-class classification problem, and it seems feasible to choose k=6 also due to the elbow method, that’s why, next choice for k is k = 6.
k2 <- kmeans(data_sampled, centers = 6, nstart = 50)
fviz_cluster(k2, data = data_sampled)
Although these clusters formed are not distinctly separated in 2D, they are insightful in both k = 2, and k = 6 choices.
These clustering graphs also seem that they are more meaningful in a 3D space, however, I was not able to visualize them in 3D. Yet, the 3D-like structure they all form suggest a consistency with what are obtained with the PCA and MDS results.
library(dbscan)
## Warning: package 'dbscan' was built under R version 4.1.2
set.seed(123)
data_sampled2 <- data[sample(1:nrow(data), 1000), ] #sub sampling
R documentation suggests using minPts as “equal to the dimensionality of the data plus one or higher”. There are 6 input features so 6-7 is suggested. However in my manually done parameter search I found 4 to be the best option which is not far from 6. Document suggests using “kNNdistplot()” for choosing “eps” value. A sudden increase of the kNN distance suggests that the right of the sudden increase is most likely to be outliers. Inspecting the below graph the sudden incease is between 0 and 70. After searching the 0-70 range manually, the best value found is for “eps” is 15. DBSCAN documantation is in this link https://www.rdocumentation.org/packages/dbscan/versions/1.1-8/topics/dbscan
minPtsParam = 7
kNNdistplot(data_sampled2, minPtsParam)
### Eps: 35, MinPts:4
cluster_dbscanChosen <- dbscan(data_sampled2, eps = 15, minPts = 7)
fviz_cluster(cluster_dbscanChosen, data = data_sampled2)
cluster_dbscan <- dbscan(data_sampled2, eps = 50, minPts = minPtsParam)
fviz_cluster(cluster_dbscan, data = data_sampled2)
cluster_dbscan <- dbscan(data_sampled2, eps = 70, minPts = minPtsParam)
fviz_cluster(cluster_dbscan, data = data_sampled2)
I will move on with the “best” clusters presented above for external evaluation, ie, Hierarchical with Ward’s Link (cut at level=2), and 2-Means Clustering.
Using all data without subsampling:
data_normalized <- normalize(data[,2:ncol(data)])
dist_mat_n <- dist(data_normalized, method = 'euclidean')
##rand index for ward's link
hc_ward <- hclust(dist_mat_n, method = 'ward.D2')
clusters <- cutree(hc_ward, k = 2)
true_labels <- factor(data$Output..S., labels = c(1,2))
res1 = external_validation(as.numeric(true_labels), clusters, method = "adjusted_rand_index")
res1
## [1] 0.2922942
The Rand Index for this clustering appears to be truly large, so one can deduce that is, in fact, a good clustering.
k2 <- kmeans(data, centers = 2, nstart = 50)
res1 = external_validation(as.numeric(true_labels), k2$cluster, method = "adjusted_rand_index")
res1
## [1] 0.2393556
Rand index for 2-Means Clustering is acutely lower than that of Ward’s Link Method. I think this analysis suffices to say that Ward’s Link method outperformed 2-Means Clustering for this data, however, further comparison and analysis will be shown.
First, I will evaluate the clustering with 2 and 6 clusters, as that is what is meaningful for the data.
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 1000), ]
intern <- clValid(data_sampled, nClust = c(2, 6), clMethods = c("hierarchical"), validation = "internal", method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
summary(intern)
##
## Clustering Methods:
## hierarchical
##
## Cluster sizes:
## 2 6
##
## Validation Measures:
## 2 6
##
## hierarchical Connectivity 4.1944 31.8734
## Dunn 0.1695 0.2067
## Silhouette 0.5047 0.6007
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 4.1944 hierarchical 2
## Dunn 0.2067 hierarchical 6
## Silhouette 0.6007 hierarchical 6
For the sake of inclusivity of all data, subsampling again:
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 1000), ]
intern <- clValid(data_sampled, nClust = c(2, 6), clMethods = c("hierarchical"), validation = "internal", method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
summary(intern)
##
## Clustering Methods:
## hierarchical
##
## Cluster sizes:
## 2 6
##
## Validation Measures:
## 2 6
##
## hierarchical Connectivity 4.1944 31.8734
## Dunn 0.1695 0.2067
## Silhouette 0.5047 0.6007
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 4.1944 hierarchical 2
## Dunn 0.2067 hierarchical 6
## Silhouette 0.6007 hierarchical 6
As can be seen from the summary of clustering evaluation, one is not dominantly “superior to” the other, which again makes sense considering the semantics behind data, ie, there are both binary and 6-class classification problems defined over the dataset.
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 1000), ]
clmethods <- c("hierarchical","kmeans")
intern <- clValid(data_sampled, nClust = c(2, 6), clMethods = clmethods, validation = "internal", method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
summary(intern)
##
## Clustering Methods:
## hierarchical kmeans
##
## Cluster sizes:
## 2 6
##
## Validation Measures:
## 2 6
##
## hierarchical Connectivity 4.1944 31.8734
## Dunn 0.1695 0.2067
## Silhouette 0.5047 0.6007
## kmeans Connectivity 8.8020 26.6619
## Dunn 0.1304 0.1562
## Silhouette 0.5006 0.6305
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 4.1944 hierarchical 2
## Dunn 0.2067 hierarchical 6
## Silhouette 0.6305 kmeans 6
set.seed(123)
data_sampled1 <- data[sample(1:nrow(data), 1000), ]
clmethods <- c("hierarchical","kmeans")
intern <- clValid(data_sampled, nClust = c(2, 6), clMethods = clmethods, validation = "internal", method="ward")
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
summary(intern)
##
## Clustering Methods:
## hierarchical kmeans
##
## Cluster sizes:
## 2 6
##
## Validation Measures:
## 2 6
##
## hierarchical Connectivity 4.1944 31.8734
## Dunn 0.1695 0.2067
## Silhouette 0.5047 0.6007
## kmeans Connectivity 8.8020 26.6619
## Dunn 0.1304 0.1562
## Silhouette 0.5006 0.6305
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 4.1944 hierarchical 2
## Dunn 0.2067 hierarchical 6
## Silhouette 0.6305 kmeans 6
Connectivity should be minimized, whereas, the others should be maximized. The graph for connectivity suggests, more number of clusters yield poorer clustering and it is almost the same for both hierarchical and k-means clustering.
The more number of clusters seem to yield better clustering for both K-means and hierarchical clustering according to Dune Index graph. Also, as the number of clusters is 6, it seems that the performance of hierarchical and k-means are almost the same, which might suggest that 6 is a feasible choice for the number of clusters, which is consistent with the defined problem and previous analyses.
Analysis on these graphs does not yield obvious results regarding which one outperfomed the other. It might be said that k-means is better (not dominantly) using the internal indices. However, this is inconsistent with what is discussed in Rand Index evaluation.
clValid function does not accept DBSCAN as a method, thus silhouette index found seperately. Average silhouette value is 0.3. The range for Silhouette Index is between -1 and 1. The found result is 0.3 which is higher than 0 which makes this result acceptable. Silhouette index combines Dunn index which evaluates inter cluster distance and Davies-Bouldin Index which evaluates intra cluster distance. So this index by itself is a satisfying metric.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sil.dbscan <- silhouette(cluster_dbscanChosen$cluster, dist(data_sampled2))
library(RColorBrewer)
clusterColours <- brewer.pal(9,"Set1")
plot(sil.dbscan, border=NA, col=clusterColours[sort(clusters)], main="")
Using external metrics to evaluate the clusterings, which is more viable, we can conclude that hierarchical clustering with Ward’s Link method clearly outperforms the other metrics. However, the internal indices, although do not suggest directly the opposite, might have been misleading if we did not have the true labels, ie, ground truth.
As a result of convenient results of all steps so far, it seems that data forms clusters. That is, upon the arrival of a new data instance, one can infer by projecting that instance point to the dimensions forming the plots and assess whether it falls into the “possibly-faulty” category and behave accordingly, which was the whole purpose from the beginning.
[1] Prakash, E. S. (2021, May 22). Electrical fault detection and classification. Kaggle. Retrieved November 30, 2021, from https://www.kaggle.com/esathyaprakash/electrical-fault-detection-and-classification?select=detect_dataset.csv.
[2] Comprehensive R Archive Network (CRAN). (2021, July 20). Multidimensional scaling [R package smacof version 2.1-3]. The Comprehensive R Archive Network. Retrieved November 30, 2021, from https://cran.r-project.org/web/packages/smacof/.